Supplemental Information for Clatskanie Manuscript

Author

Brandon Sloan

Published

03 May, 2024

1 Clatskanie Soil and Plant Trait Data

This notebook provides the analysis and figures used in the manuscript: Which plant traits drive soil carbon sequestration? Empirical evidence from a long‑term Populus genetic diversity trial. We explore fractionated soil core data taken at 69 trees covering 24 genotypes at the Clatskanie Common Garden site. Below, I will briefly describe the data followed by analyses corresponding to the differing sections of the manuscript.

I have harmonized multiple datasets taken from the Clatskanie common garden site since 2009. Primarily, this analysis relies on soil and root chemistry data taken as part of Melanie Mayes’ SEED project in summer of 2022, while the root exudate and architecture data were taken as part of John Field’s LDRD project in fall of 2022. The aboveground biomass data was provided by Mirko Pavicic-Venegas from the overall CBI database. Full details on these harmonized data sets are given in 01-harmonize-clatskanie-data.html.

Code
d_path <- "02-data/"
df_fit <-
  read_csv(file.path(d_path, "02-processed/clatskanie-c-fit-data.csv")) |>
  mutate(block = as.character(block),
         across(where(is.character), \(x) fct(x)),
         SiCl = Silt + Clay)

# Clatskanie latitude and longitude data from Stan Martins
df_ll <-
  read_csv(file.path(d_path, "01-raw/03-seed/clatskanie-lat-lon.csv")) |>
  mutate(
    genotype = as.factor(CBI_genotype),
    block = as.factor(CBI_block),
    tree = fct_cross(genotype, block, sep = "-")
  ) |>
  select(tree, latitude, longitude)
Table 1:

Clatskanie common garden 2022 soil sampling data.

ord tree genotype block Depth latitude longitude Row Column Trait BD MCwetBD MCdryBD MCwetCHEM MCdryCHEM pH Sand Clay Silt POMpN POMpC MAOMpN MAOMpC POMN POMC MAOMN MAOMC TotpN TotpC TotC pCa pK pMg pP pC pN pS Lig Alppm Bppm Crppm Cuppm Feppm Mnppm Nappm Nippm Znppm root_CN agb cop_id height TotC_st MAOMC_st POMC_st pMAOM pPOM MAOM_POM TotC_agb MAOM_CN POM_CN soil_CN C_chk C_chk_abs C_chk_rel C_chk_id TotC_st_30 TotC_30 SiCl
1 BESC-102-1 BESC-102 1 15 46.12150 -123.2688 18 8 HF 0.8840147 0.3735775 0.5963666 0.420 0.7241379 5.106 0.000000 45.90000 55.75000 0.8930000 19.61 0.313 4.05 0.53 11.560000 3.05 39.52000 0.406 6.040000 60.40000 0.71988 0.43521 0.15765 0.07964 47.489 0.955 0.08403 26.58 7378.17 16.350 73.565 15.167 6213.76 152.445 285.622 7.073 55.162 49.72670 5.374592 Coppiced 930 80.09173 52.40439 15.328814 65.43046 19.13907 3.418685 11.238062 12.95738 21.81132 14.87685 15.430464 -9.320000 -15.430464 Good 99.32942 44.21547 101.65000
2 BESC-102-2 BESC-102 2 15 46.12114 -123.2706 61 21 HF 0.6193467 0.4093323 0.6929993 0.434 0.7667845 5.089 7.850000 45.15000 47.00000 0.8050635 16.53 0.298 3.40 0.40 9.365594 2.83 52.97187 0.526 7.636667 76.36667 0.81550 0.36465 0.12897 0.10552 51.277 1.017 0.11408 32.71 5044.21 11.643 54.669 21.753 3243.34 283.393 307.614 9.595 73.663 50.41986 3.864154 Non-coppiced 1000 70.94617 49.21193 8.700825 69.36517 12.26398 5.656008 19.762841 18.71798 23.41398 14.51838 18.370846 -14.029202 -18.370846 Good 132.22369 70.85999 92.15000
3 BESC-102-3 BESC-102 3 15 46.12107 -123.2718 92 22 HF 0.7910838 0.3241181 0.4795485 0.352 0.5432099 5.700 8.390796 38.20600 51.20320 0.6419370 13.30 0.399 5.64 0.37 7.590000 3.68 51.99000 0.385 5.350000 53.50000 0.75848 0.61040 0.19529 0.14161 44.073 1.107 0.10951 24.03 8403.85 16.980 256.507 23.006 8770.00 165.941 335.891 15.061 51.250 39.81301 8.027529 Non-coppiced 710 63.48447 61.69267 9.006489 97.17757 14.18692 6.849802 6.664567 14.12772 20.51351 13.89610 11.364486 6.080000 11.364486 Good 104.23235 42.84569 89.40920
4 BESC-119-1 BESC-119 1 15 46.12163 -123.2685 9 4 HSG 0.8169904 0.3625429 0.5687329 0.346 0.5290520 5.706 0.000000 37.90262 62.87141 0.5610000 11.57 0.280 3.39 0.34 7.070000 2.65 32.04000 0.332 4.243181 42.43181 0.90493 0.71556 0.09802 0.11765 51.781 0.997 0.11546 35.71 2109.76 11.844 14.033 24.694 1833.12 149.316 107.555 4.876 57.833 51.93681 3.344784 Coppiced 720 51.99957 39.26456 8.664183 75.50938 16.66203 4.531825 12.685966 12.09057 20.79412 12.78067 7.828594 -3.321814 -7.828594 Good 72.03720 29.87678 100.77403
5 BESC-131-1 BESC-131 1 15 46.12119 -123.2684 6 20 LSG 0.7943094 0.3892863 0.6374284 0.366 0.5772871 5.454 3.883255 38.78308 57.33366 0.6490000 12.98 0.366 4.99 0.39 7.850000 3.54 48.31000 0.446 6.785570 67.85570 0.64087 0.33447 0.06985 0.08973 53.107 0.939 0.10276 25.55 2443.64 10.979 53.096 28.368 2629.39 170.083 129.500 7.140 49.091 56.55698 2.228805 Coppiced 730 80.84763 57.55963 9.352993 71.19520 11.56867 6.154140 30.444874 13.64689 20.12821 15.21428 17.236135 -11.695700 -17.236135 Good 156.24599 67.34213 96.11674
6 BESC-131-2 BESC-131 2 15 46.12087 -123.2704 57 31 LSG 0.7018199 0.4175004 0.7167392 0.400 0.6666667 5.579 7.180124 41.88820 50.93168 0.8160000 18.11 0.398 5.71 0.41 9.150000 3.83 54.95000 0.497 7.552126 75.52126 0.63574 0.34650 0.10121 0.09406 52.093 0.977 0.10809 37.24 2502.82 9.928 19.896 39.533 1375.17 149.925 156.146 8.133 72.110 53.31934 6.980348 Non-coppiced 1290 79.50348 57.84751 9.632478 72.76097 12.11579 6.005464 10.819124 14.34726 22.31707 15.19542 15.123234 -11.421257 -15.123234 Good 120.22660 65.37997 92.81988

1.1 Soil C Fractionation

The first figure of the manuscript looks at the MAOM versus POM soil C (Figure 2), where MAOM soil C dominates the Clatskanie site compared to POM C. There also may be a thresholding behavior around a MAOM value of 60 (Figure 2 b), but this could simply be noise in the data. Overall, the consequences for MAOM may be more indicative of total C at this site.

Code
g1 <- df_fit |>
  pivot_longer(c(MAOMC, POMC, TotC)) |>
  mutate(name  = fct_recode(
    name,
    MAOM = "MAOMC",
    POM = "POMC",
    Total = "TotC"
  )) |>
  ggplot(aes(y = value,
             x = name)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(size = 1, shape = 4) +
  ylab("0-15 cm Soil C \n [mg C/g soil]") +
  xlab("Soil C Pool") +
  theme_cowplot(fs)

g1

Figure 1: Comparison of MAOM, POM and Total soil carbon in concentration units.
Code
g1 <- df_fit |>
  pivot_longer(c(MAOMC, POMC, TotC)) |>
  mutate(name  = fct_recode(
    name,
    MAOM = "MAOMC",
    POM = "POMC",
    Total = "TotC"
  )) |>
  ggplot(aes(y = value,
             x = name)) +
  geom_boxplot() +
  ylab("0-15 cm Soil C \n [mg C/g soil]") +
  xlab("Soil C Pool") +
  theme_cowplot(fs)


g2 <- df_fit |>
  ggplot(aes(y = MAOMC,
             x = POMC)) +
  geom_point() +
  geom_smooth(method = "gam") +
  #stat_poly_eq(use_label(c("eq", "adj.R2"))) +
  ylab("0-15 cm MAOM Soil C \n [mg C/g soil]") +
  xlab("0-15 cm POM Soil C \n [mg C/g soil]") +
  theme_cowplot(fs)

g3 <- g1 + g2 + plot_annotation(tag_levels = "a", tag_suffix = ')')
g3

Figure 2: Comparison of MAOM soil C versus POM soil C versus a) total soil C and b) each other in concentration units

Here and throughout the analysis, I look at soil C results with respect to both concentration (mg C/g soil) and stock (tonnes C/ha). In the manuscript, we focus on concentration units because the stock units may be confounded by two bulk density uncertainties. First, the cores used to measure bulk density are different than the cores for which the concentration was extracted. Second, bulk density decreases with soil C as less dense plant material fills a fixed volume. Therefore, the stock measurements would need to adjust for the change in density if taking soil C from a fixed core. At Clatskanie, this adjustment is difficult given there are no measurements at the start of the experiment.

Nevertheless, stock units are typically useful for carbon sequestration and land management studies as they tend to concern how much space there is for carbon in the soil. Figure 3 shows the same MAOM-POM comparison from Figure 2.

Code
g1 <- df_fit |>
  pivot_longer(c(MAOMC_st, POMC_st, TotC_st)) |>
  mutate(name  = fct_recode(
    name,
    MAOM = "MAOMC_st",
    POM = "POMC_st",
    Total = "TotC_st"
  )) |>
  ggplot(aes(y = value,
             x = name)) +
  geom_boxplot() +
  ylab("0-15 cm Soil C \n [Tonnes C/ha]") +
  xlab("Soil C Pool") +
  theme_cowplot(fs)


g2 <- df_fit |>
  ggplot(aes(y = MAOMC_st,
             x = POMC_st)) +
  geom_point() +
  geom_smooth(method = "gam") +
  #stat_poly_eq(use_label(c("eq", "adj.R2"))) +
  ylab("0-15 cm MAOM Soil C \n [Tonnes C/ha]") +
  xlab("0-15 cm POM Soil C \n [Tonnes C/ha]") +
  theme_cowplot(fs)

g1 + g2 + plot_annotation(tag_levels = "a", tag_suffix = ')')

Figure 3: Comparison of MAOM soil C versus POM soil C versus a) total soil C and b) each other in stock units.
Code
# Genotype average
df_fit |> group_by(genotype) |> 
  summarize(across(.cols = c(MAOMC,POMC), .fns = list(mn = mean, sd = sd))) |> 
  ggplot(aes(y = MAOMC_mn,
             x = POMC_mn)) +
  geom_point() +
  geom_linerange(aes(xmin = POMC_mn - POMC_sd, xmax = POMC_mn + POMC_sd)) +
  geom_linerange(aes(ymin = MAOMC_mn - MAOMC_sd, ymax = MAOMC_mn + MAOMC_sd)) +
  geom_smooth(method = "lm") +
  stat_poly_eq(use_label(c("eq", "adj.R2"))) +
  ylab("0-15 cm MAOM Soil C \n [mg C/g soil]") +
  xlab("0-15 cm POM Soil C \n [mg C/g soil]") +
  theme_cowplot(fs)

Figure 4: Genotype average MAOM and POM comparison in concentration units.

2 Spatial Confounding

2.1 Fit Thin-plate Spline (TPS)

The rich Clatskanie data set has one glaring flaw—there was no initial site soil characterization. Therefore, we do not know whether changes in soil C (and other phenotypes) are due to the poplar trees or simply due to the soil texture, nutrients, drainage and other location-specific controls. If we analyze data as is for changes in soil C, we will be implicitly assuming that all locations started at the same soil C (the average of all sites) in 2009 and any observed deviations in 2022 are due to the poplar trees. An alternative approach uses a spatial model (e.g. thin-plate spline) to remove the spatial variability in the soil C and other phenotype observations (Kristy et al. 2022). Here, we use the R package fields to fit a thin-plate spline to our variables of interest, keeping the model residuals as the adjusted values that represent tree-level effects. However, this method may be inadequate, resulting in unreliable downstream analyses, which we address in the next section. Next, we will cover the fitting process.

Code
# Set variables to spatially adjust (sa)
x_name <- "longitude"
y_name <- "latitude"
sa_vars <-
  c(
    "TotC",
    "TotC_30",
    "POMC",
    "MAOMC",
    "TotC_st",
    "TotC_st_30",
    "POMC_st",
    "MAOMC_st",
    "MAOM_POM",
    "soil_CN",
    "agb",
    "height",
    "root_CN",
    "Lig",
    "pH",
    "pCa",
    "pMg",
    "pK",
    "pP",
    "pS",
    "Clay",
    "Sand",
    "Silt",
    "SiCl",
    "BD",
    grep("ppm",colnames(df_fit), value = TRUE)
  )


# Function for running spline
run_tps <- function(df, x_name, y_name, z_name, method = "GCV") {
  
  # Create matrix of spatial coordinates
  s <- df[, c(x_name, y_name)]
  z <- df |> pluck(z_name)
  
  # Fit thin-plate spline and kriging
  fit_tps <- Tps(s, z, lon.lat = TRUE, method = method)
  # fit_tps <-
  #   spatialProcess(s, z, profileLambda = TRUE, profileARange = TRUE)
  
  # Predict plotting surface
  surf_tps <- predictSurface(fit_tps)[c("x", "y", "z")]
  df_surf <-
    expand_grid(latitude = surf_tps$y, longitude = surf_tps$x) |>
    mutate(z_tps = as.vector(surf_tps$z))
  
  # Estimates and adjusted values
  z_adj <- df |> select(tree) |>
    mutate(adj = as.vector(fit_tps$residuals))
  
  # Check normality and variance explained
  p_nrm <- shapiro.test(fit_tps$residuals)$p.value
  var_exp <- 1 - var(fit_tps$residuals) / var(z)
  df_eff <- fit_tps$eff.df
  
  return(list(
    surf = df_surf,
    adj = z_adj,
    var_exp = as.vector(var_exp),
    norm_test  = p_nrm,
    df_eff = df_eff
  ))
  
}

# Spatially adjust variables
sa_methods <- c("REML", "GCV")
sa_runs <- expand_grid(sa_vars,sa_methods)

sa_results <-
  sa_runs |>
  mutate(sa = map2(
    sa_vars,
    sa_methods,
    \(x, y) run_tps(df_fit, x_name, y_name, x, method = y)
  )) |>
  unnest_wider(sa) |>
  rename(name = sa_vars,
         method = sa_methods) |> 
  mutate(across(name:method, fct))

Figure 5 shows the fit spatial fields at Clatskanie, where the circles represent the observation locations. The size of the circles is proportional to the absolute value of the residual, or left-over effect after spatial variation is removed. The soil C variables do appear to have a strong spatial pattern, with high values occurring in the middle of the site (i.e., block 2) and lower values in the east and west (i.e. blocks 1 and 3, respectively). However, other variables only have slight gradients (e.g. pCa, pMg, Nappm). In the next section, we will take a closer look at the these spline fits and discuss potential limitations.

Code
# A tibble of adjusted values
df_sa <- sa_results|>
  unnest(adj) |>
  select(name, method, tree, adj, var_exp, norm_test) |>
  left_join(df_ll, by = "tree")

# Summary statistics
df_sa_stats <- sa_results |>
  select(name, method, norm_test, var_exp, df_eff) |>
  mutate(df_eff = df_eff / nrow(df_fit)) |>
  left_join(df_fit |>
              summarize(across(all_of(sa_vars), \(x) mad(x) / median(x))) |>
              pivot_longer(cols = everything(), values_to = "CV"),
            by = "name")


# Only show spatial plot for REML
sa_results |>
  filter(method %in% "REML") |>
  unnest(surf) |>
  group_by(name) |>
  mutate(across(.cols = z_tps, .fns = scale)) |>
  ggplot(aes(
    x = longitude,
    y = latitude,
    z = z_tps,
    fill = z_tps
  )) +
  geom_tile() +
  geom_contour(color = "white") +
  geom_point(
    data = df_sa |>
      filter(method %in% "REML") |>
      group_by(name) |>
      mutate(across(.cols = adj, .fns = scale)),
    aes(x = longitude, y = latitude, size = abs(adj)),
    color = "black",
    # size = 1,
    shape = 21,
    inherit.aes = FALSE
  ) +
  xlab("Longitude") +
  ylab("Latitude") +
  coord_fixed() +
  scale_fill_distiller(
    type = "div",
    palette = "RdYlGn",
    name = "Std. Dev.",
    limits = c(-3, 3)
  ) +
  scale_size_continuous(limits = c(0, 2),
                        name = "Abs. Std. \n Residual") +
  facet_wrap(~ name, dir = "v", ncol = 3) +
  force_panelsizes(rows = unit(1, "in"),
                   cols = unit(2.2, "in")) +
  theme_cowplot(fs) +
  theme(legend.position = "bottom",
        legend.justification = "center")

# Clear heavy file
rm(sa_results)

Figure 5: Thin-plate spline fits using the restricted maximum likelihood criteria for all outcomes of interes. The circle size corresponds to the absolute satndardized residual from the TPS surface. The colors and contours correspond to the standardized TPS predictions.

2.2 Potential Limitations of TPS Spatial Adjustment

Although spatial corrections appear to be common practice for extracting genotype effects from common garden experiments Evans et al. (2014), the fits are not typically addressed in the supplements of such works (some authors do check normality of residuals). I question whether the model only removes the spatial variation in the data or if it removes spatial and genotypic variability. The logic is that since the Clatskanie experiment is a block-randomized design, a coherent spatial pattern would not be the genotype effects, but would be the tree-level environmental interaction. However, I find several issues with this logic, which I briefly discuss below:

  1. There is no way to know whether the background variability is actually removed. An extremely flexible TPS model can simply fit to every point well removing most of the variability. The generalized cross-validation (GCV) approach employed by fields to tune the smoothing parameter should combat this over-fitting. However, the limited number of samples and their clustering make it very possible that trees next to each other could have similar genotypic effects. In fact, Wood (2017) explicitly mentions that under small sample sizes, restricted maximum likelihood estimates (REML) may be more efficient than GCV. However, I have a suspicion that both could subject to spatial confounding (Dupont, Wood, and Augustin 2022a) and may be unreliable. In summary, I do not think the Tps or other spatial methods guarantee that only spatial variability is removed..

  2. Spatial model errors will bias downstream results. The spatial model will be influenced by the clustering in samples, large data values, and variability and errors in the observations. A poor spatial model fit will contaminate the analysis with pseudo-data that has no real information about the process we are interested in. Additionally, the risk of this will increase with smaller effect sizes, greater observation noise, and less variability in the data. In summary, this spatial adjustment approach requires that we fit two models extremely well—the spatial model and the downstream regressions.

  3. In the case of Clatskanie, we have spatial observations of soil texture, a block factor, and coppicing, which likely account for much of the site-specific variability (i.e., abiotic or non-tree induced variability). Therefore, it may be simpler to use these as covariates in our regression model than to spatially correct certain variables with their own individual models prone to the errors described above.

To determine how well the TPS model does, I have calculated a few summary statistics for each variable of interest (Figure 6). The Figure 6 a) shows the variance explained by the TPS model with Total C, height, MAOMC, and Clay have 75% of their variance explained spatially. Notably, they all have over 30% of the possible degrees of freedom, indicating a very flexible model. However, when Clay does uses REML rather than GCV, the variance explained plummets as does the model flexibility. This is clearly a case of GCV over-fitting a small sample. Regardless, Total C, MAOM C, and height have much variability explained regardless of the fit method. The question remains: is that variance explained purely spatial? And would using soil texture, coppicing and a block effect explain a similar amount of variability.

Code
g1 <- df_sa_stats |>
  mutate(name = fct_reorder(name, var_exp)) |>
  pivot_longer(cols = c(var_exp, CV, norm_test, df_eff),
               names_to = "stat") |>
  mutate(
    stat = fct_recode(
      fct(stat, levels = c("var_exp", "df_eff", "CV", "norm_test")),
      `Variance Explained` = "var_exp",
      `Norm. Effective df` = "df_eff",
      `Coefficient of Variation` = "CV",
      `Norm. p-value` = "norm_test"
    )
  ) |>
  ggplot(aes(y = name,
             x = value,
             fill = method)) +
  geom_col(position = "dodge", color = "black") +
  geom_vline(
    data = tibble(stat = fct(c("Norm. p-value", "Variance Explained")),
                  thresh = c(0.05,0.1)),
    aes(xintercept = thresh),
    colour = "red",
    linetype = "dashed"
  ) +
  coord_cartesian(xlim = c(0, 1), clip = "off") +
  xlab("Statistic Value") +
  ylab("phenotype") +
  facet_wrap(~ stat, nrow = 1) +
  force_panelsizes(rows = unit(5.75, "in"),
                   cols = unit(1.8, "in")
                   ) +
  theme_cowplot(fs) +
  theme(legend.position = "bottom", legend.justification = "center")
g1 <- tag_facet2(g1, open = "", vjust = -2, hjust = 1)
g1
rm(g1)

Figure 6: Thin-plate spline (TPS) fit resultsusing the restricted maximum likelihood (REML) and generalized cross-validation (GCV) procedures: a) variance explained by TPS, b) the effective degrees of freedom normalized by number of observations, c) the coefficient of variation of each predictor, and d) the p-value from the Shpariro-Wilks normality test performed on TPS residuals. Note, that high values in panel b) indicate an extremely flexible model, and values below the red dashed line in panel d) indicate non-normal residuals.

All other possible phenotypes have much lower variance explained spatially and less flexibility. Notably, nearly 2/3 of the models fail to have normality of residuals Figure 6 d), which may or may not be important. The non-normality could indicate outliers that influence the surface fit or multiple zero values as in the case of sand. It could also indicate bias in the surface. Gelman, Hill, and Vehtari (n.d.) does not recommend looking at qq-plots and normality of residuals as it has little influence on the fit, and more influence on prediction. However, here we briefly examine the qq-plots (Figure 7) and density estimates of residuals (Figure 8). We see that many of the outcomes with non-normal residuals are positively skewed (i.e., have large positive residuals), which indicates that their median under-estimates values.

The question is: what do I do with this information? Many of these outcomes whose residuals are skewed also have skewed distributions. Although these two are not necessarily related, performing a log transformation could help to compress the data. However, this does not really make sense, given that I am fitting a thin-plate spline that allows for non-linearity. Overall, it would appear these models are not well-explained spatially, and, thus a correction should not be performed. However, this leaves the practical problem of which variables to spatially adjust.

Code
df_sa |>
  #filter(name %in% c("MAOMC", "Clay", "agb", "Alppm")) |>
  group_by(name, method) |> 
  mutate(adj_scale = scale(adj,scale = TRUE),
         sig = ifelse(norm_test < 0.05, "Problem", "OK"),
         vjust = ifelse(method %in% "REML", 1, 0)) |> 
  ungroup() |> 
  ggplot(aes(sample = adj_scale, colour = method)) +
  stat_qq() +
  stat_qq_line() + 
  geom_text(aes(label = sig, color = method, vjust = vjust), x = -1.25, y = 6) + 
  xlab("Standard Normal Quantiles") +
  ylab("Outcome Quantiles") +
  facet_wrap(~name) +
  theme_cowplot(fs) +
  theme(legend.position = "bottom",
        legend.justification = "center")

Figure 7: Quantile-quantile plots of the TPS residuals using two different fitting methods.
Code
df_sa |>
  #filter(name %in% c("MAOMC", "Clay", "agb", "Alppm")) |>
  group_by(name, method) |>
  mutate(
    adj_scale = scale(adj, scale = TRUE),
    sig = ifelse(norm_test < 0.05, "Problem", "OK"),
    vjust = ifelse(method %in% "REML", 1, 0)
  ) |>
  group_by(name, method) |>
  mutate(adj_scale = scale(adj, scale = TRUE)) |>
  ungroup() |>
  ggplot(aes(x = adj_scale, colour = method)) +
  geom_density() +
  stat_function(
    fun = dnorm,
    n = 101,
    args = list(mean = 0, sd = 1),
    color = "black",
    linetype = "dashed"
  ) +
  geom_text(aes(
    label = sig,
    color = method,
    vjust = vjust
  ),
  x = -1.25,
  y = 0.6) +
  xlab("Normalized Values") +
  ylab("Kernal Density") +
  facet_wrap(~ name) +
  theme_cowplot(fs) +
  theme(legend.position = "bottom",
        legend.justification = "center")

Figure 8: Kernal density estimate of normalized residuals with the standard normal distribution as a dashed black line.

For consistency, I assumed that variables that have less than 10% of their variability explained by the TPS should not be spatially adjusted (i.e., outcomes left of the red dashed line in Figure 6 a)). Additionally, I do not adjust Sand and Clay for later analyses as these would likely not be influenced by the presence of trees, unlike other soil properties, such as bulk density and pH. Therefore, I have three total data sets that will be carried through in the downstream regression: 1) raw, 2) TPS adjusted using GCV, and 3) TPS adjusted using REML. Overall, I trust the raw data the most, followed by REML and lastly the GCV given its over-fitting issues for small samples. The hope with using all three data sets is that we see similar relative results for heritability and drivers of soil C.

Code
# Update df_fit variables to spatially adjusted values 
df_fit_sa_reml <- df_fit |>
  rows_update(
    df_sa |>
      filter(method %in% "REML", var_exp >= 0.1) |>
      select(-method,-var_exp, -norm_test) |>
      pivot_wider(values_from = adj,
                  names_from = name) |>
      select(-Sand,-Clay,-SiCl,-Silt),
    by = "tree"
  )

df_fit_sa_gcv <- df_fit |>
  rows_update(
    df_sa |>
      filter(method %in% "GCV", var_exp >= 0.1) |>
      select(-method, -var_exp, -norm_test) |>
      pivot_wider(values_from = adj,
                  names_from = name) |>
      select(-Sand,-Clay,-SiCl,-Silt),
    by = "tree"
  )
Code
library(mgcv)
k_sp = 30

# Change to sum contrast coding - difference from grand mean
options(contrasts = rep("contr.sum", 2))

# Splits for cross-validation
set.seed(1)
v_folds <- 10
v_rep <- 5
make_df_cv <- \(x) vfold_cv(x,
                            v = v_folds,
                            repeats = v_rep)

df_cv <- make_df_cv(df_fit)

get_adj <- function(mdl) {
  mdl <- mdl |> extract_fit_engine()
  tibble(adj = list(mdl$residuals),
       r2_train = list(mdl |> summary() |> pluck("r.sq")))
}

# Create new cv sets
tidy_ctrl <- control_grid(extract = get_adj)


spec_gam <- gen_additive_mod() %>% 
  set_engine("mgcv") %>% 
  set_mode("regression")

fit_gam <- workflow() %>% 
  add_model(spec_gam, formula = MAOMC ~ s(latitude, longitude)) |> 
  add_formula(MAOMC ~ latitude + longitude) |> 
  fit(data = df_fit) |> 
  extract_fit_engine()

fit_gam <- workflow() %>% 
  add_model(spec_gam, formula = MAOMC ~ s(latitude, longitude)) |> 
  add_formula(MAOMC ~ latitude + longitude) |> 
  fit_resamples(resamples = df_cv, control = tidy_ctrl)


fit_gam |> unnest(.extracts) |> unnest(.extracts) |> unnest(adj)

fit_gam |> collect_metrics()
# Predictors
predictors_sub <-
  c(
    "agb",
    "height",
    "root_CN",
    "Lig",
    "pK",
    "pCa",
    "pMg",
    "pP",
    "pS",
    "Clay",
    "Sand",
    "pH",
    "block",
    "cop_id"
  )

predictors_all <-
  c(predictors_sub,
    grep("ppm", colnames(df_fit), value = TRUE))



# Create multiple recipes with differing outcomes
update_outcome <- function(rec, old_oc, new_oc) {
  new_rec <- rec |>
    update_role(all_of(old_oc), new_role = "ID") |>
    update_role(all_of(new_oc), new_role = "outcome")
}

outcomes <- c("MAOMC","agb", "Alppm")


# Create a workflow for the linear model
base_rcp <- 
  recipe(TotC ~ ., data = df_fit) |> 
  update_role(-all_of(c(predictors_all,"TotC")), new_role = "ID") |> 
  step_dummy(all_nominal_predictors()) |> 
  step_normalize(all_predictors()) |> 
  step_zv(all_predictors())

no_metal_rcp <- base_rcp |> 
  update_role(-all_of(c(predictors_sub,"TotC")), new_role = "ID") 

# Create recipe for each outcome
rcps <-
  expand_grid(outcome = outcomes,
              rcp_type = list(all = base_rcp, no_metal = no_metal_rcp)) |>
  mutate(
    rcps = map2(outcome, rcp_type, \(x, y) update_outcome(y, "TotC", x)),
    rcp_id = names(rcp_type)
  ) |>
  select(-rcp_type)

# GLMNET model spec
gn_specs = list(
  ridge = linear_reg(penalty = tune(), mixture = 0) |>
    set_engine("glmnet") |>
    set_mode("regression"),
  lasso =
    linear_reg(penalty = tune(), mixture = 1) |>
    set_engine("glmnet") |>
    set_mode("regression")
)

# Grid of GLMNET penalty
gn_grid <- grid_regular(penalty(range = c(-2, 2)),
                        #mixture(range = c(0, 1)),
                        levels = 50)

# Set up parallel cores
all_cores <- parallel::detectCores(logical = TRUE)
cl <- makePSOCKcluster(all_cores-1)
registerDoParallel(cl)

# Workflows
reg_results <- expand_grid(rcps, gn_specs, df_cv) |>
  mutate(
    spec_id = names(gn_specs),
    data_id = names(df_cv),
    wfs = map2(rcps, gn_specs, \(x, y) workflow(
      preprocessor = x, spec = y
    )),
    run = paste(outcome, data_id, spec_id, sep = "-"),
    across(where(is.character), fct),
    tuned_wfs = map2(wfs, df_cv,
                     \(x, y) tune_grid(x, resamples = y, grid = gn_grid))
  ) |> 
  select(-df_cv)

# Unnest regularization curves   
reg_curves <- reg_results |>
  select(where(is.factor),tuned_wfs) |>
  mutate(
    all_perf = map(tuned_wfs, \(x) collect_metrics(x)),
    best_perf = map(
      tuned_wfs,
      \(x) select_by_pct_loss(x, metric = "rmse", penalty, limit = 1)
    )
  ) |>
  select(-tuned_wfs)

# Create final fits
fits_final <- reg_results |>
  left_join(enframe(dfs, name = "data_id", value = "df"), by = "data_id") |>
  mutate(
    wfs_final = map2(wfs,
                     tuned_wfs,
                     \(x, y) finalize_workflow(
                       x,
                       select_by_pct_loss(y,
                                          metric = "rmse",
                                          penalty,
                                          limit = 1)
                     )),
    fit_final = future_map2(wfs_final, df, \(x, y) fit(x, data = y)),
    prms = future_map(fit_final, \(x) tidy(x))
  ) |> 
  select(outcome, rcp_id, spec_id, data_id, run, wfs_final, fit_final, prms)
 
rm(reg_results)


# mdl <- gam(TotC~s(latitude, k = k_sp) + s(longitude, k = k_sp) + ti(latitude,longitude, bs = "tp", k = k_sp) + s(Sand, k = k_sp) + cop_id, data = df_fit, method = "REML")
mdl <- gam(TotC~s(latitude, longitude, k = k_sp), data = df_fit, method = "GCV.Cp", family = scat())
mdl <- gam(TotC~s(Sand) + s(Clay), data = df_fit, method = "GCV.Cp")

summary(mdl)
shapiro.test(mdl$residuals)
gam.check(mdl)
plot(mdl)

# Create matrix of spatial coordinates
s <- df_fit[, c("latitude", "longitude")]
z <- df_fit |> pluck("TotC")

# Fit thin-plate spline and kriging
fit_tps <- Tps(s, z, lon.lat = TRUE, method = "REML")
shapiro.test(fit_tps$residuals)
fit_tps$eff.df
df_fit |> ggplot(aes(x = TotC)) + geom_density()

2.3 Questions for Mirko and Hari

  1. Does the uneven sampling (Figure 5), small sample size, and larger spatial scale at Clatskanie invalidate the spatial correction? Yes, this likely has an effect and possibly invalidates the spatial correction. Typically, these analyses are done on denser measurement sets. A possible check would be to regress phenotypes against latitude of origin to determine if there is any signal. Then regress those residuals using TPS to see if the large spatial explanations still exist.

  2. Do you often check the effective degrees of freedom to verify that the model is not memorizing the data? It does not seem like this is common practice, but is important. Rashon mentioned that fields may allow for greater control for k-fold cross-validation.

  3. Have you ever tested with a toy example whether the TPS method is able to accurately remove only spatial variability? They have not, although Mirko did check something similar with leaf color to ensure that a sparsely sampled area was predicting well. Rashon also shared (Harman-Ware et al. 2021) where they used an ad hoc surface complexity statistic to decide whether the TPS fit was necessary. A good follow-up methods paper could did into the validity of these correction methods for sparse, noisy datasets.

  4. Do you often run into non-normal residuals? Do you perform variable transformations or just live with the result. Yes, they often run into them, but deem them unimportant for model fit. Also, Rashon suggested that it is the final downstream residuals that should be normal and not the immediate TPS correction. However, non-normality could indicate influential points (large residuals) that will bias the fit. I guess this is a disadvantage of this two-stage regression approach.

  5. How do you select which variables are spatially corrected? Hari mentioned only correcting variables that have increased heritability with the correction, but that assumes that the spatial correction is working appropriately. I think the approach laid out in (Harman-Ware et al. 2021) with the SC metric may be good for future work. However, here I do not think we will use spatial correction.

3 Heritability Analysis

3.1 Fit Linear Mixed Effects Model

Based on the recommendation of Hari Chetri, we are going to try to quantify the heritability of each phenotype to determine whether there are any biological controls. Heritability is simply the variance explained by genotype in phenotype data (\(H^2 = \frac{\sigma_G}{\sigma_P}\)). We will use a linear mixed effects models with genotype as a random effect to quantify heritability.

Code
# Outcome variables and labels for plotting
outcome <-
  c(
    "TotC",
    "TotC_30",
    "POMC",
    "MAOMC",
    "TotC_st",
    "TotC_st_30",
    "POMC_st",
    "MAOMC_st",
    "MAOM_POM",
    "soil_CN",
    "BD",
    "agb",
    "height",
    "root_CN",
    "Lig",
    "pH",
    "pCa",
    "pMg",
    "pK",
    "pP",
    "pS",
    grep("ppm",colnames(df_fit), value = TRUE)
  )

outcome_labels <-
  c(
    "Total C",
    "Total C 30cm",
    "POM C",
    "MAOM C",
    "Total C Stock",
    "Total C Stock 30cm",
    "POM C Stock",
    "MAOM C Stock",
    "MAOM C/POM C",
    "Soil C/N",
    "Bulk Density",
    "Aboveground Growth",
    "Height",
    "Root C/N",
    "Root Lignin",
    "Soil pH",
    "Root Calcium",
    "Root Magnesium",
    "Root Potassium",
    "Root Phosphorus",
    "Root Sulfur",
    "Root Aluminum",
    "Root Boron",
    "Root Chromium",
    "Root Copper",
    "Root Iron",
    "Root Manganese",
    "Root Sodium",
    "Root Nickel",
    "Root Zinc"
  )

outcome_pools <- 
  c(
    "Soil",
    "Soil",
    "Soil",
    "Soil",
    "Soil",
    "Soil",
    "Soil",
    "Soil",
    "Soil",
    "Soil",
    "Soil",
    "Aboveground\n Biomass",
    "Aboveground\n Biomass",
    "Root",
    "Root",
    "Soil",
    "Root",
    "Root",
    "Root",
    "Root",
    "Root",
    "Root",
    "Root",
    "Root",
    "Root",
    "Root",
    "Root",
    "Root",
    "Root",
    "Root"
    )

var_key <- tibble(phenotype = outcome, 
                  label = outcome_labels, 
                  pool = outcome_pools)

# Helper function for calculating broad sense heritability
calc_H2 <- function(df, y, fma) {
  fma <- as.formula(paste0(y, fma))
  mdl <- lme4::lmer(fma, data = df)
  mdl_sds <-
    tidy(mdl) |>
    filter(group %in% c("genotype","Residual")) |>
    pluck("estimate")
  genotype_var <- mdl_sds[1]^2
  #error_sd <- mdl_sds[2]
  error_var <- var(resid(mdl))
  H2 <- genotype_var / (genotype_var + error_var)
  warn <- mdl@optinfo$conv$lme4$messages
  # Check normality and variance explained
  p_nrm <- shapiro.test(resid(mdl))$p.value
  var_exp <- 1 - error_var / var(df |> pluck(y))
  var_rem <-  (error_var + genotype_var) / var(df |> pluck(y))
  p_htc <- check_heteroscedasticity(mdl) |> as.double()
  gt_eff <- modelbased::estimate_grouplevel(mdl)
  
  return(list(H2 = H2, 
              warn = warn,
              p_nrm = p_nrm,
              var_exp = var_exp,
              var_rem = var_rem,
              p_htc = p_htc,
              gt_eff = gt_eff))
  
}

# Formulas and data for mixed effect model
fma_full <- "~ Silt + Clay + cop_id + block + (1|genotype)"
fma_min <- "~ (1|genotype)"
fmas <- list(full = fma_full, min = fma_min)
dfs <- list(raw = df_fit, sa_reml = df_fit_sa_reml, sa_gcv = df_fit_sa_gcv)
  
# y = "Alppm"
# fma <- as.formula(paste0(y, fma_full))
# 
# mdl <- lme4::lmer(fma, data = df_fit)
# check_model(mdl)
# test <- rstanarm::stan_glmer(fma, data = df_fit, family = gaussian())
# check_model(test)
# test <- modelbased::estimate_grouplevel(mdl)
# 
#   mdl_sds <-  tidy(mdl) |>
#     filter(group %in% c("genotype","Residual")) |>
#     pluck("estimate")

# Calculate H2
H2_results <-
  expand_grid(outcome, fmas, dfs) |>
  mutate(
    model = names(fmas),
    data = names(dfs),
    run = paste(outcome, model, data, sep = "-"),
    across(model:run, fct)
  ) |>
  mutate(fits = pmap(list(dfs, outcome, fmas),
                     \(x, y, z) calc_H2(x, y, z))) |>
  unnest_wider(fits) |>
  select(outcome, model, data, run, H2:gt_eff) |> 
  rename(phenotype = outcome)

Figure 9 illustrates the sensitivity of \(H^2\) to our mixed effects models and spatial data correction. As would be expected, the raw data set tends to have lower \(H^2\) compared to the spatially corrected as we have removed variability from the data, making the denominator smaller. Additionally, the full mixed effects model appears to strengthen \(H^2\) (solid versus dashed border) in most cases (POM_st is a counter-example). This could indicate that controlling for the variance of block, soil texture, and coppicing allows the model to explain the genotype variance better. Or it could mean that the flexibility allowed by other parameters could allow over-fitting. We did not perform cross-validation as accounting for missing factors can be tedious; however, that may be a good future step to take.

Code
g1 <- H2_results |>
  mutate(name = fct_reorder(phenotype, H2)) |>
  pivot_longer(cols = c(H2,var_exp, var_rem, p_nrm),
               names_to = "stat") |>
  mutate(
    stat = fct_recode(
      fct(stat, levels = c("H2","var_exp", "var_rem", "p_nrm")),
      `Heritability, H2` = "H2",
      `Variance Explained` = "var_exp",
      `Variance Remaining` = "var_rem",
      `Norm. p-value` = "p_nrm"
    )
  ) |>
  ggplot(aes(y = name,
             x = value,
             fill = data,
             linetype = model)) +
  geom_col(position = "dodge", color = "black") +
  geom_vline(
    data = tibble(stat = fct(c("Norm. p-value")),
                  thresh = c(0.05)),
    aes(xintercept = thresh),
    colour = "red",
    linetype = "dashed"
  ) +
  coord_cartesian(xlim = c(0, 1), clip = "off") +
  xlab("Statistic Value") +
  ylab("phenotype") +
  facet_wrap(~ stat, nrow = 1) +
    force_panelsizes(rows = unit(5.75, "in"),
                   cols = unit(1.7, "in")
                   ) +
  theme_cowplot(fs) +
  theme(legend.position = "bottom", legend.justification = "center") 

g1 <- tag_facet2(g1, open = "", vjust = -2, hjust = 1)
g1
rm(g1)  

Figure 9: Linear mixed model performance using raw and spatially-adjusted datasets and differing model formulations. We look at the a) heritability, b) variance explained by the model, c) variance remaining to be explained by genotype, and d) the p-value from a normality test of model residuals.

3.2 Selected Heritability Results

Figure 10 and Figure 11 shows the \(H^2\) results for the full model fit to the raw data and the minimal model fit to the spatially-correct data using REML criteria. The results show there are not major differences between the two, except for teh most important phenotype, which is MAOM C---the spatial correction nearly doubles the \(H^2\) value. I trust the REML results more than the GCV correction, due to the smaller effective degrees of freedom. The lack of difference between the raw and spatially adjusted calls into question its use in this analysis.

Code
H2_results |>
  filter(data %in% "raw", model %in% "full") |>
  left_join(var_key, by = "phenotype") |>
  ggplot(aes(
    y = fct_reorder(label, H2),
    x = H2,
    fill = pool
  )) +
  geom_col(position = "dodge", color = "black", linewidth = 0.2) +
  geom_vline(xintercept = 0.1,
             color = "red",
             linetype = "dashed") +
  coord_cartesian(xlim = c(0, 1)) +
  xlab(expression(Broad ~ Sense ~ Heritability ~ H ^ 2)) +
  ylab("Phenotype") +
  theme_cowplot(fs) +
  theme(legend.position = c(0.6, 0.2)) +
  guides(fill = guide_legend(title = "C Pool"))

fig_path <- "./04-outputs/02-figures/03-manuscript"
# ggsave2(file.path(fig_path, "fig-5.jpg"),
#         width = 5, height = 5)

Figure 10: Broad sense heritability results for the raw data set.
Code
H2_results |>
  filter(data %in% "sa_reml", model %in% "min") |>
  left_join(var_key, by = "phenotype") |>
  ggplot(aes(
    y = fct_reorder(label, H2),
    x = H2,
    fill = pool
  )) +
  geom_col(position = "dodge", color = "black", linewidth = 0.2) +
  geom_vline(xintercept = 0.1,
             color = "red",
             linetype = "dashed") +
  coord_cartesian(xlim = c(0, 1)) +
  xlab(expression(Broad ~ Sense ~ Heritability ~ H ^ 2)) +
  ylab("Phenotype") +
  theme_cowplot(fs) +
  theme(legend.position = c(0.6, 0.2)) +
  guides(fill = guide_legend(title = "C Pool"))

Figure 11: Broad sense heritability results for the TPS-REML data set.

3.3 Questions for Mirko and Hari

  1. Do you often adjust the underlying assumptions in lmer or glmer? Figure 9 shows that many of the mixed-effects models have non-normal residuals. If I take a closer look at the genotype-level averages for the raw and corrected data sets (Figure 12), they are also non-normal. Do you typically use glmer and select a Gamma family for the distribution rather than Gaussian? I have seen in this before in Blumstein et al. (2022).TheyagreedthatusingaGamma or other distribution seems appropriate. Mirko said he has done it before with a beta function.

  2. Do you qualify your heritability results by the amount of variability in the dataset, especially if the spatial correction removes a large chunk of the variability?Not really. However, Rashon did mention one method whereby you normalize the phenotype before TPS correction, fit the mixed effect model, and then normalize the residuals after. He said this may allow the denominator not to influence the calculation. However, I am not quite sure, I need to think about this option more and possibly ask follow-up questions.

  3. How do you quantify uncertainty in your heritability estimates? Rashon mentioned that used the vpred package with a mixed effect model to do the linear uncertainty approximation of heritability. He said the given standard deviation by lmer is not actually the estimated error, but rather the square root of the given variance. I could also take the Bayesian approach using stanlmer

  4. Do you ever use other covariates in your heritability analysis? Here we have used soil texture, coppicing, and block in models without spatial correction. Yes, Hari has often used other covariates. I cited an example of trying to explain pre-dawn water potential and needing time of day.

Code
dfs <- list(raw = df_fit, sa_reml = df_fit_sa_reml, sa_gcv = df_fit_sa_gcv)

calc_gt_avg <- function(df, cols) {
  df |>
    group_by(genotype) |>
    summarize(across(all_of(cols), \(x) mean(x, na.rm = TRUE))) |>
    ungroup() |>
    pivot_longer(all_of(cols)) |>
    group_by(name) |>
    mutate(
      norm_test = shapiro.test(value)$p.value,
      sig = ifelse(norm_test < 0.05, "Problem", "OK")
    ) |>
    ungroup()
}


df_gt_avg <- enframe(dfs, name = "method") |> 
  mutate(gt_avg = map(value, \(x) calc_gt_avg(x, outcome))) |> 
  select(-value) |> 
  unnest(gt_avg)

df_gt_avg |>
  #filter(name %in% c("MAOMC", "Clay", "agb", "Alppm")) |>
  group_by(name, method) |>
  mutate(
    adj_scale = scale(value, scale = TRUE),
    sig = ifelse(norm_test < 0.05, "Problem", "OK"),
    vjust = case_when(
      method == "raw" ~ -1,
      method == "sa_reml" ~ 0,
      method == "sa_gcv" ~ 1)
  ) |>
  ungroup() |>
  ggplot(aes(x = adj_scale, 
             colour = method
             )) +
  geom_density() +
  stat_function(
    fun = dnorm,
    n = 101,
    args = list(mean = 0, sd = 1),
    color = "black",
    linetype = "dashed"
  ) +
  geom_text(aes(
    label = sig,
    color = method,
    vjust = vjust
  ),
  x = -1.4,
  y = 0.75) +
  xlab("Normalized Values") +
  ylab("Kernal Density") +
  facet_wrap(~ name) +
  theme_cowplot(fs)

Figure 12: The normalized distribution of genotype means for the raw and spatially adjusted data sets. The blaack deashed line represents a standard normal distribution. The label ‘Problem’ indicates that the distribution failed the Shapiro-Wilks test of normality.

4 Site and Phenotypic Controls on Soil Carbon

4.1 Regularize the Regression with GLMNET

We next want to assess whether the site characteristics and observed phenotypes can explain the difference in soil carbon at Clatskanie. Here, we use the raw and two spatially corrected data sets and test two sets of predictors: 1) all site, soil texture and chemistry, aboveground growtth, and root chemisty predictors (23 predictors labelled all in figures), and 2) a subset excluding root metals (14 predictors labelled no_metal in figures). We regress these predictors on Total, MAOM, and POM concentrations and stocks.

The large number of correlated predictors could result in model over-fitting and highly variable effects. To combat this, we perform regularized regression and cross-validate the results to get a better estimate of the uncertainty in effects and predictive performance. We perform both ridge and LASSO regression using the R package glmnet contained in the tidymodels ecosystem. The two regularized regressions are end-members of elastic-net regression which penalizes parameter norms causing a shrinkage in parameter values (some to 0 in the case of LASSO) the biases the model fit to the training sample while improving the model’s ability to fit to out-of-bag samples (Sohil, Sohali, and Shabbir 2022). The regularized regression requires tuning of the penalty or regularization parameter, which we do via 10-fold cross-validation repeated 5 times. We select the penalty parameter for each regression that minimizes the root mean square error (RMSE).

Code
# Change to sum contrast coding - difference from grand mean
options(contrasts = rep("contr.sum", 2))

# Splits for cross-validation
set.seed(1)
v_folds <- 10
v_rep <- 5
make_df_cv <- \(x) vfold_cv(x,
                            v = v_folds,
                            repeats = v_rep)

df_cv <- list(
  raw = make_df_cv(df_fit),
  sa_reml = make_df_cv(df_fit_sa_reml),
  sa_gcv = make_df_cv(df_fit_sa_gcv)
)

# Predictors
predictors_sub <-
  c(
    "agb",
    "height",
    "root_CN",
    "Lig",
    "pK",
    "pCa",
    "pMg",
    "pP",
    "pS",
    "SiCl",
#    "Silt",
    "pH",
#    "block",
    "cop_id"
  )

predictors_all <-
  c(predictors_sub,
    grep("ppm", colnames(df_fit), value = TRUE))



# Create multiple recipes with differing outcomes
update_outcome <- function(rec, old_oc, new_oc) {
  new_rec <- rec |>
    update_role(all_of(old_oc), new_role = "ID") |>
    update_role(all_of(new_oc), new_role = "outcome")
}

outcomes <- c("TotC","MAOMC","POMC", "TotC_st","MAOMC_st","POMC_st")


# Create a workflow for the linear model
base_rcp <- 
  recipe(TotC ~ ., data = df_fit) |> 
  update_role(-all_of(c(predictors_all,"TotC")), new_role = "ID") |> 
  step_dummy(all_nominal_predictors()) |> 
  step_normalize(all_predictors()) |> 
  step_zv(all_predictors())

no_metal_rcp <- base_rcp |> 
  update_role(-all_of(c(predictors_sub,"TotC")), new_role = "ID") 

# Create recipe for each outcome
rcps <-
  expand_grid(outcome = outcomes,
              rcp_type = list(all = base_rcp, no_metal = no_metal_rcp)) |>
  mutate(
    rcps = map2(outcome, rcp_type, \(x, y) update_outcome(y, "TotC", x)),
    rcp_id = names(rcp_type)
  ) |>
  select(-rcp_type)

# GLMNET model spec
gn_specs = list(
  ridge = linear_reg(penalty = tune(), mixture = 0) |>
    set_engine("glmnet") |>
    set_mode("regression"),
  lasso =
    linear_reg(penalty = tune(), mixture = 1) |>
    set_engine("glmnet") |>
    set_mode("regression")
)

# Grid of GLMNET penalty
gn_grid <- grid_regular(penalty(range = c(-2, 2)),
                        #mixture(range = c(0, 1)),
                        levels = 50)

# Set up parallel cores
all_cores <- parallel::detectCores(logical = TRUE)
cl <- makePSOCKcluster(all_cores-1)
registerDoParallel(cl)

# Workflows
reg_results <- expand_grid(rcps, gn_specs, df_cv) |>
  mutate(
    spec_id = names(gn_specs),
    data_id = names(df_cv),
    wfs = map2(rcps, gn_specs, \(x, y) workflow(
      preprocessor = x, spec = y
    )),
    run = paste(outcome, data_id, spec_id, sep = "-"),
    across(where(is.character), fct),
    tuned_wfs = map2(wfs, df_cv,
                     \(x, y) tune_grid(x, resamples = y, grid = gn_grid))
  ) |> 
  select(-df_cv)

# Unnest regularization curves   
reg_curves <- reg_results |>
  select(where(is.factor),tuned_wfs) |>
  mutate(
    all_perf = map(tuned_wfs, \(x) collect_metrics(x)),
    best_perf = map(
      tuned_wfs,
      \(x) select_best(x, metric = "rmse", penalty)
    )
  ) |>
  select(-tuned_wfs)

# Create final fits
fits_final <- reg_results |>
  left_join(enframe(dfs, name = "data_id", value = "df"), by = "data_id") |>
  mutate(
    wfs_final = map2(wfs,
                     tuned_wfs,
                     \(x, y) finalize_workflow(
                       x,
                       select_best(y,
                                          metric = "rmse",
                                          penalty)
                     )),
    fit_final = future_map2(wfs_final, df, \(x, y) fit(x, data = y)),
    prms = future_map(fit_final, \(x) tidy(x))
  ) |> 
  select(outcome, rcp_id, spec_id, data_id, run, wfs_final, fit_final, prms)
 
rm(reg_results)    

Figure 13 shows the RMSE and \(R^2\) as a function of the penalty parameter (i.e., regularization curves) for the cross-validation for each soil carbon outcome. Ideally, the RMSE (\(R^2\)) curves would decrease (increase) with a greater penalty (i.e. more regularization) as we reduce variance (i.e., over-fitting errors to out-of-bag samples), but then increase (decrease) as model bias (i.e. model’ inability to describe the data) increases. Here, we focus on RMSE as \(R^2\) is known to be positively biased for small samples (Cramer 1987). Here are the key takeaways:

  • Only the raw dataset (not spatially corrected) appears to provide a decent predictive model. The monotonically decreasing curve with penalty parameter (until it becomes horizontal) for POM and sa_reml and sa_gcv indicates that performance is best as all parameters are shrunk to 0---this is the same as saying the mean or an intercept model is more predictive than our model. This is further highlighted by these curves having all values at or above 1, which indicates the standard deviation of the data. The fact that the raw red curves for MAOM and Total C lie below 1 indicates that our model is explaining some of the variance in the out-of-bag samples.

  • The LASSO (triangles) results tend to give same or better performance (lower RMSE) than the ridge regression (circles). This is a positive as the glmnet algorithm for LASSO performs feature selection by forcing weaker parameters to 0.

  • The \(R^2\) regularization curves are more variable and erratic, likely due to the poor efficiency of the metric with small samples (Cramer 1987).

  • The chosen penalty parameters is shown by the vertical line for each regression.

Code
df_cv_stats <- bind_rows(df_cv, .id = "data_id") |>
  mutate(sd = map(splits, \(x) assessment(x) |>
                    summarize(across(
                      .cols = all_of(outcomes), \(x) sd(x, na.rm = TRUE)
                    )))) |>
  unnest(sd) |>
  pivot_longer(cols = all_of(outcomes), names_to = "outcome") |>
  group_by(data_id, outcome) |>
  summarize(median_sd = median(value, na.rm = TRUE)) |> 
  ungroup()
 
# Collect regularization curves
reg_curves |>
  filter(rcp_id %in% "all", outcome %in% c("TotC","MAOMC", "POMC")) |>
  unnest(all_perf) |>
  left_join(df_cv_stats, by = c("data_id", "outcome")) |>
  mutate(
    mean = ifelse(.metric  %in% "rmse", mean / median_sd, mean),
    std_err = ifelse(.metric  %in% "rmse", std_err / median_sd, std_err)
  ) |>
  ggplot(aes(
    x = log10(penalty),
    y = mean,
    color = data_id,
    linetype = spec_id,
    shape = spec_id
  )) +
  geom_errorbar(aes(ymin = mean - std_err, ymax = mean + std_err),
                width =
                  .2) +
  #geom_line(size = 1, linetype = "dashed") +
  geom_point(size = 2) +
  # geom_point(data = reg_curves |> unnest(best_perf)) +
  geom_vline(
    data = reg_curves |> filter(rcp_id %in% "all", outcome %in% c("TotC","MAOMC", "POMC")) |> unnest(best_perf),
    aes(
      xintercept = log10(penalty),
      color = data_id,
      linetype = spec_id,
    ),
    linewidth = 0.5
  ) +
  scale_y_continuous("Cross-validation Statistic") +
  scale_x_continuous("Penalty Parameter") +
  facet_grid2(`.metric` ~ outcome, scales = "free", independent = "none") +
  theme_cowplot(fs) + 
  theme(legend.position = "bottom", legend.justification = "center")

Figure 13: The regularization curves from 10-fold cross-validation used to specify the penalty value for the Ridge and LASSO regressions for soil C concentrations.

Figure 14 shows the regularization curves for the soil carbon stocks. Interestingly, the spatially corrected regressions do show an optimum performance (i.e., minimum RMSE). However, their values are all above 1, which indicates that the model is explaining little if any of the variability in the out-of-bag samples. In other words, a certain penalty improves performance, but that model does not outperform the mean as a predictor.

Code
df_cv_stats <- bind_rows(df_cv, .id = "data_id") |>
  mutate(sd = map(splits, \(x) assessment(x) |>
                    summarize(across(
                      .cols = all_of(outcomes), \(x) sd(x, na.rm = TRUE)
                    )))) |>
  unnest(sd) |>
  pivot_longer(cols = all_of(outcomes), names_to = "outcome") |>
  group_by(data_id, outcome) |>
  summarize(median_sd = median(value, na.rm = TRUE)) |> 
  ungroup()
 
# Collect regularization curves
reg_curves |>
  filter(rcp_id %in% "all", outcome %in% c("TotC_st","MAOMC_st", "POMC_st")) |>
  unnest(all_perf) |>
  left_join(df_cv_stats, by = c("data_id", "outcome")) |>
  mutate(
    mean = ifelse(.metric  %in% "rmse", mean / median_sd, mean),
    std_err = ifelse(.metric  %in% "rmse", std_err / median_sd, std_err)
  ) |>
  ggplot(aes(
    x = log10(penalty),
    y = mean,
    color = data_id,
    linetype = spec_id,
    shape = spec_id
  )) +
  geom_errorbar(aes(ymin = mean - std_err, ymax = mean + std_err),
                width =
                  .2) +
  #geom_line(size = 1, linetype = "dashed") +
  geom_point(size = 2) +
  # geom_point(data = reg_curves |> unnest(best_perf)) +
  geom_vline(
    data = reg_curves |> filter(rcp_id %in% "all", outcome %in% c("TotC_st","MAOMC_st", "POMC_st")) |> unnest(best_perf),
    aes(
      xintercept = log10(penalty),
      color = data_id,
      linetype = spec_id,
    ),
    linewidth = 0.5
  ) +
  scale_y_continuous("Cross-validation Statistic") +
  scale_x_continuous("Penalty Parameter") +
  facet_grid2(`.metric` ~ outcome, scales = "free", independent = "none") +
  theme_cowplot(fs) + 
  theme(legend.position = "bottom", legend.justification = "center")

Figure 14: The regularization curves from 10-fold cross-validation used to specify the penalty value for the Ridge and LASSO regressions for soil C stocks.

Lastly, we take a closer look at the MAOM regularization curves with respect to the full (all) and subset (no_metals) of predictors. We see that the only useful model is with the raw dataset. Using all predictors appears to offer a slight benefit for performance, as does the LASSO method.

Code
# Calculate outcome standard deviations for normalization
# dfs_stats <- bind_rows(dfs, .id = "data") |>
#   pivot_longer(cols = all_of(outcomes), names_to = "outcome") |>
#   group_by(data, outcome) |>
#   summarize(sd = sd(value, na.rm = TRUE))

df_cv_stats <- bind_rows(df_cv, .id = "data_id") |>
  mutate(sd = map(splits, \(x) assessment(x) |>
                    summarize(across(
                      .cols = all_of(outcomes), \(x) sd(x, na.rm = TRUE)
                    )))) |>
  unnest(sd) |>
  pivot_longer(cols = all_of(outcomes), names_to = "outcome") |>
  group_by(data_id, outcome) |>
  summarize(median_sd = median(value, na.rm = TRUE)) |> 
  ungroup()
 
# Collect regularization curves
reg_curves |>
  unnest(all_perf) |>
  filter(outcome %in% "MAOMC", .metric %in% "rmse") |>
  left_join(df_cv_stats, by = c("data_id", "outcome")) |>
  mutate(
    mean = ifelse(.metric  %in% "rmse", mean / median_sd, mean),
    std_err = ifelse(.metric  %in% "rmse", std_err / median_sd, std_err)
  ) |>
  ggplot(aes(
    x = log10(penalty),
    y = mean,
    color = spec_id,
    linetype = rcp_id,
    shape = rcp_id
  )) +
  geom_errorbar(aes(ymin = mean - std_err, ymax = mean + std_err),
                width =
                  .2) +
  #geom_line(size = 1, linetype = "dashed") +
  geom_point(size = 2) +
  # geom_point(data = reg_curves |> unnest(best_perf)) +
  geom_vline(
    data = reg_curves |> filter(outcome %in% "MAOMC") |> unnest(best_perf),
    aes(
      xintercept = log10(penalty),
      color = spec_id,
      linetype = rcp_id,
    ),
    linewidth = 0.5
  ) +
  scale_y_continuous("Cross-validation Statistic") +
  scale_x_continuous("Penalty Parameter") +
  facet_grid2(`.metric` ~ data_id, scales = "free", independent = "none") +
  theme_cowplot(fs) + 
  theme(legend.position = "bottom", legend.justification = "center")

Figure 15: The regularization curves from 10-fold cross-validation used to specify the penalty value for the Ridge and LASSO regressions.

4.2 Extract Effects from Regularized Regression

4.2.1 Cross-validate Effects

Rather than simply fitting the finalized regularized regression models to the full data set to get a point estimate of effects, we would like to get a better estimate of the sampling distribution of the effects and the model predictive performance. To do this, we repeat the 10-fold cross-validation 10 times using the tuned penalty parameter. This provides us 100 estimates of the predictor effects and 100 estimates of out-of-bag performance. Note, this is very similar to bootstrapping except we guarantee the model is trained on 90% of the data rather than the average 63% in bootstrapping. This is critical in our analysis as we only have 66 data points.

Code
predictors_plt <-
  c(
    "agb",
    "height",
    "cop_id_Non.coppiced",
    "root_CN",
    "Lig",
    "pCa",
    "pMg",
    "pK",
    "pP",
    "pS",
    grep("ppm",colnames(df_fit), value = TRUE),
    "SiCl",
#    "Silt",
    "pH"
  )

outcome_labels <-
  c(
    "Aboveground Growth",
    "Height",
    "Coppicing",
    "Root C/N",
    "Root Lignin",
    "Root Calcium",
    "Root Magnesium",
    "Root Potassium",
    "Root Phosphorus",
    "Root Sulfur",
    "Root Aluminum",
    "Root Boron",
    "Root Chromium",
    "Root Copper",
    "Root Iron",
    "Root Manganese",
    "Root Sodium",
    "Root Nickel",
    "Root Zinc",
    "Silt + Clay",
#    "Silt",
    "Soil pH"
  )

outcome_pools <-
  c(
    "Aboveground \n Biomass",
    "Aboveground \n Biomass",
    "Aboveground \n Biomass",
    "Root",
    "Root",
    "Root",
    "Root",
    "Root",
    "Root",
    "Root",
    "Root",
    "Root",
    "Root",
    "Root",
    "Root",
    "Root",
    "Root",
    "Root",
    "Root",
    "Soil",
#    "Soil",
    "Soil"
  )

var_key <- tibble(term = predictors_plt, 
                  label = outcome_labels, 
                  pool = outcome_pools)
Code
# Add more resamples
v_folds <- 10
v_rep <- 20
make_df_cv <- \(x) vfold_cv(x,
                            v = v_folds,
                            repeats = v_rep)

df_cv <- enframe(list(
  raw = make_df_cv(df_fit),
  sa_reml = make_df_cv(df_fit_sa_reml),
  sa_gcv = make_df_cv(df_fit_sa_gcv)
), name = "data_id", value = "df_cv")

# Create new cv sets
tidy_ctrl <- control_grid(extract = \(x) tidy(x))

cv_fits <- fits_final |>
  left_join(df_cv, by = "data_id") |> 
  select(c(df_cv,outcome:run, wfs_final)) |>
  mutate(
    cv_fit = map2(
      wfs_final,
      df_cv,
      \(x, y) fit_resamples(x, resamples = y, control = tidy_ctrl)
    ),
    metrics = map(cv_fit, \(x) collect_metrics(x, summarize = FALSE)),
    effects = map(cv_fit, collect_extracts)
  ) |> 
  select(-df_cv, -cv_fit)

We first take a look at the effects distribution for soil carbon concentrations, comparing the full and reduced set up predictors. Figure 16 illustrates that both sets of predictors give similar signs yet varying magnitudes and spreads of effects. In particular, including all predictors appears to increase the magnitude and spread of the root Mg effect, while 4 or the 12 root metals appear to have a non-negligible effect. Note, POM shows all effects centered around zero as the regularization simply creates an intercept model.

Code
cv_fits |>
  select(-metrics) |>
  unnest(effects) |>
  unnest(.extracts) |> 
  filter(term %in% predictors_plt,
         outcome %in% c("TotC", "MAOMC", "POMC"),
         data_id %in% "raw",
         spec_id %in% "lasso") |>
  mutate(phenotype  = fct_recode(
    outcome,
    MAOM = "MAOMC",
    POM = "POMC",
    Total = "TotC"
  )) |> 
  left_join(var_key, by = "term") |> 
  mutate(label = fct(label, levels = rev(var_key$label))) |> 
  ggplot(aes(
    y = label,
    x = estimate,
    fill = pool,
    alpha = rcp_id,
    linetype = rcp_id
  )) +
  ggridges::geom_density_ridges(scale = 1) +
  geom_vline(xintercept = 0) +
  #geom_col(position = "dodge", color = "black") +
  xlab("Soil C Effect [mg C/g soil/SD]") +
  ylab("Predictor") +
  facet_wrap(~phenotype, scales = "free_x") +
  theme_cowplot(fs) +
  theme(legend.position = "bottom",
        legend.justification = "center")

Figure 16: Site and phenotypic effects on soil C concentration for differing predictors using the LASSO method and raw data.

Overall, the soil C stock effects (Figure 17) appear to agree with the previous concentration results.

Code
cv_fits |>
  select(-metrics) |>
  unnest(effects) |>
  unnest(.extracts) |> 
  filter(term %in% predictors_plt,
         outcome %in% c("TotC_st", "MAOMC_st", "POMC_st"),
         data_id %in% "raw",
         spec_id %in% "lasso") |>
  mutate(phenotype  = fct_recode(
    outcome,
    MAOM = "MAOMC_st",
    POM = "POMC_st",
    Total = "TotC_st"
  )) |> 
  left_join(var_key, by = "term") |> 
  mutate(label = fct(label, levels = rev(var_key$label))) |> 
  ggplot(aes(
    y = label,
    x = estimate,
    fill = pool,
    alpha = rcp_id,
    linetype = rcp_id
  )) +
  ggridges::geom_density_ridges(scale = 1) +
  geom_vline(xintercept = 0) +
  #geom_col(position = "dodge", color = "black") +
  xlab("Soil C Effect [tonnes C/ha/SD]") +
  ylab("Predictor") +
  facet_wrap(~phenotype, scales = "free_x") +
  theme_cowplot(fs) +
  theme(legend.position = "bottom",
        legend.justification = "center")

Figure 17: Site and phenotypic effects on soil C stock for differing predictors using the LASSO method and raw data.

Finally, we look at the predictive performance distributions (Figure 18) to verify our final assumption set for the manuscript figures. We see that all predictors appear to reduce RMSE, while there is minimal performance difference between LASSO

Code
cv_fits |>
  select(-effects) |>
  unnest(metrics) |>
  left_join(df_cv_stats, by = c("data_id", "outcome")) |>
  mutate(.estimate = ifelse(.metric  %in% "rmse", .estimate / median_sd, .estimate)) |>
  filter(data_id %in% "raw") |> 
  ggplot(aes(
    y = outcome,
    x = .estimate,
    color = rcp_id,
    fill = rcp_id,
    linetype = spec_id
  )) +
  ggridges::geom_density_ridges(scale = 1, alpha = 0.4) +
  geom_vline(xintercept = 1) +
  scale_fill_manual(values = c("red",NA)) +
  xlab("Soil C Effect [mg C/g soil/SD]") +
  ylab("Predictor") +
  facet_wrap( ~ .metric, scales = "free_x") +
  theme_cowplot(fs) +
  theme(legend.position = "bottom",
        legend.justification = "center")

Figure 18: Predictive performance of regularized regression models for differing soil outcomes and predictors.

4.2.2 Selected Effects

For the publication, we have decided to focus on MAOM concentration effects for all predictors using the raw data and showing both the ridge and LASSO regression results. Figure 19 is Figure 5 in the manuscript.

Code
cv_fits |>
  select(-metrics) |>
  unnest(effects) |>
  unnest(.extracts) |> 
  filter(term %in% predictors_plt,
         outcome %in% c("MAOMC"),
         data_id %in% "raw",
         rcp_id  %in% "all") |>
  mutate(outcome  = fct_recode(
    outcome,
    MAOM = "MAOMC",
    POM = "POMC",
    Total = "TotC"
  )) |> 
  left_join(var_key, by = "term") |> 
  mutate(label = fct(label, levels = rev(var_key$label))) |> 
  ggplot(aes(
    y = label,
    x = estimate,
    fill = pool,
    linetype = spec_id,
    alpha = spec_id
  )) +
  ggridges::geom_density_ridges(scale = 2) +
  geom_vline(xintercept = 0) +
  #geom_col(position = "dodge", color = "black") +
  xlab("Soil C Effect [mg C/g soil/SD]") +
  ylab("Predictor") +
  facet_wrap(~outcome, scales = "free_x") +
  theme_cowplot(fs) +
  theme(legend.position = "bottom",
        legend.justification = "center")

fig_path <- "./04-outputs/02-figures/03-manuscript"
ggsave2(file.path(fig_path, "fig-6-dist.jpg"),
        width = 6, height = 7)

Figure 19: Site and phenotypic effects on soil C fractions for all predictors and regularizations using the raw data.

For completeness, we show the Total and MAOM effects for both concentration and stock in Figure 20 and Figure 21. Total C closely mirrors MAOM results given soil C at Clatskanie is mostly MAOM (Figure 1).

Code
cv_fits |>
  select(-metrics) |>
  unnest(effects) |>
  unnest(.extracts) |> 
  filter(term %in% predictors_plt,
         outcome %in% c("TotC", "MAOMC"),
         data_id %in% "raw",
         rcp_id  %in% "all") |>
  mutate(outcome  = fct_recode(
    outcome,
    MAOM = "MAOMC",
    POM = "POMC",
    Total = "TotC"
  )) |> 
  left_join(var_key, by = "term") |> 
  mutate(label = fct(label, levels = rev(var_key$label))) |> 
  ggplot(aes(
    y = label,
    x = estimate,
    fill = pool,
    linetype = spec_id,
    alpha = spec_id
  )) +
  ggridges::geom_density_ridges(scale = 2) +
  geom_vline(xintercept = 0) +
  #geom_col(position = "dodge", color = "black") +
  xlab("Soil C Effect [mg C/g soil/SD]") +
  ylab("Predictor") +
  facet_wrap(~outcome, scales = "free_x") +
  theme_cowplot(fs) +
  theme(legend.position = "bottom",
        legend.justification = "center")

fig_path <- "./04-outputs/02-figures/03-manuscript"
ggsave2(file.path(fig_path, "fig-6-dist.jpg"),
        width = 6, height = 7)

Figure 20: Site and phenotypic effects on soil C fractions for all predictors and regularizations using the raw data.
Code
cv_fits |>
  select(-metrics) |>
  unnest(effects) |>
  unnest(.extracts) |> 
  filter(term %in% predictors_plt,
         outcome %in% c("TotC_st", "MAOMC_st"),
         data_id %in% "raw",
         rcp_id  %in% "all") |>
  mutate(outcome  = fct_recode(
    outcome,
    MAOM = "MAOMC_st",
    Total = "TotC_st"
  )) |> 
  left_join(var_key, by = "term") |> 
  mutate(label = fct(label, levels = rev(var_key$label))) |> 
  ggplot(aes(
    y = label,
    x = estimate,
    fill = pool,
    linetype = spec_id,
    alpha = spec_id
  )) +
  ggridges::geom_density_ridges(scale = 2) +
  geom_vline(xintercept = 0) +
  #geom_col(position = "dodge", color = "black") +
  xlab("Soil C Effect [mg C/g soil/SD]") +
  ylab("Predictor") +
  facet_wrap(~outcome, scales = "free_x") +
  theme_cowplot(fs) +
  theme(legend.position = "bottom",
        legend.justification = "center")

Figure 21: Site and phenotypic effects on soil C fractions for all predictors and regularizations using the raw data.

Finally, we zoom in on the performance for all predictors and LASSO regression for the relevant soil C concentrations and stocks (Figure 22). Most notably, the predictive model is worse for stock compared to concentration.

Code
cv_fits |>
  select(-effects) |>
  unnest(metrics) |>
  left_join(df_cv_stats, by = c("data_id", "outcome")) |>
  mutate(.estimate = ifelse(.metric  %in% "rmse", .estimate / median_sd, .estimate)) |>
  filter(data_id %in% "raw",
         spec_id %in% "lasso",
         outcome %in% c("MAOMC","TotC","MAOMC_st","TotC_st")) |> 
  ggplot(aes(
    y = outcome,
    x = .estimate,
    fill = rcp_id,
    linetype = rcp_id
  )) +
  ggridges::geom_density_ridges(scale = 1, alpha = 0.4) +
  geom_vline(xintercept = 1) +
  xlab("Soil C Effect [mg C/g soil/SD]") +
  ylab("Predictor") +
  facet_wrap( ~ .metric, scales = "free_x") +
  theme_cowplot(fs) +
  theme(legend.position = "bottom",
        legend.justification = "center")

Figure 22: Performance of LASSO models for soil C concentration and stock for differing sets of predictors.

4.3 Questions for Mirko and Hari

  1. The spatial corrections removes much of the variability from MAOM and Total C (50-80% in Figure 6) concentration and stocks. The remaining variability is either all noise or heavily contaminated with uncertainty. This coupled with small sample size likely contributes to the non-predictive models using the two spatially corrected data sets (Figure 13). This result emphasizes the importance of shoring up the spatial correction. Although the spatial-adjustment of predictors and outcomes seems to be common practice at ORNL (Evans et al. 2014; Chhetri et al. 2019; Kristy et al. 2022), the formal analysis of such spatial confounding appears young (Dupont, Wood, and Augustin 2022b). Have you ever assessed the sensitivity of these spatial corrections on a toy example as in Dupont, Wood, and Augustin (2022a) or propagated uncertainty through from the spatial model to other results? We did not get to discuss this much, but I assume the answer is no. This could be a fruitful future collaboration. We could set up a toy data set, contaminate with noise, and determine the influence on downstream analysis and regressions. I think Mirko and Rashon would be ideal to help with this.

  2. Overall, the regularized regression model for the raw data on average explain 30-40% of the variance in the data. However, the spatial model is able to explain 70-80% of the variability in MAOM and Total C. Given the large number of predictors that vary spatially and are correlated, it is surprising that latitude and longitude are better predictors. One thing to consider is that the variance explained in Figure 6 is based on the full model and not on out-of-bag-samples like Figure 22. Do you ever do more rigorous cross-validation of the spatial model fits to get a better sense on whether over-fitting is occurring? Rashon briefly touched on this and thought that there may be a way to adjust fields to allow for more flexible cross-validation.

  3. How do you typically handle numerous correlated predictors and low sample sizes? Here, I have taken the approach of regularized regression. Hari had typically used variance inflation factors in vegan. Mirko had a good idea to use the xicor method and R package (Chatterjee and Holmes 2023; Chatterjee 2021), which are robust correlation coefficient alternatives. I could either use these to remove predictors or better contextualize the important predictors.

  4. Have you ever employed more Bayesian approach such as stanlmer to better characterize sampling variability and parameter correlations? Figure 23 shows one such approach using stanlmer and looking at the practical significance. The red region defines the region of practical equivalence (ROPE) and the blue region is the larger density outside of the ROPE. The higher the mass in the blue, the better we feel about a real effect existing. I did not ask this, but I am guessing they have not.

Code
library(tidybayes)
base_rec <-
  recipe(MAOMC ~ ., data = df_fit |> select(MAOMC, all_of(predictors_all))) |>
  step_normalize(all_numeric_predictors()) |>
  step_zv(all_predictors())

lmer_spec <- 
  linear_reg() %>% 
  set_engine("stan", prior = rstanarm::normal(0, 1, autoscale = FALSE))

wf_stan <- workflow() |> 
  add_recipe(base_rec) |> 
  add_model(lmer_spec)

fit_stan <- wf_stan |> 
  fit(data = df_fit)

preds <- fit_stan |> 
  extract_fit_engine() |> 
  get_variables() 

# I have to update the default ..y variable name back to MAOMC for plotting
plt_data <- fit_stan |>
  extract_fit_engine() 

plt_data$data = plt_data$data |> 
  rename(MAOMC = `..y`)

p_significance(plt_data, threshold = 0.1) |> plot(priors = FALSE) + 
  xlab("Soil C Effect [mg C/g soil/SD]") +
  ylab("Predictor") +
  theme_cowplot(fs)

#rope(test, range = c(-0.1,0.1)) |> plot(priors = TRUE) + theme_cowplot()

## Other Analysis and plots
# fit_stan |>
#   extract_fit_engine() |>
#   gather_draws(agb,
#                height,
#                root_CN,
#                Lig,
#                pK,
#                pCa,
#                pMg,
#                pP,
#                pS,
#                Clay,
#                Sand,
#                pH,
#                Alppm,
#                Bppm,
#                Crppm,
#                Cuppm,
#                Feppm,
#                Mnppm,
#                Nappm,
#                Nippm,
#                Znppm,
#                block1,
#                block2,
#                cop_id1) |>
#   ggplot(aes(
#     y = .variable,
#     x = .value,
#     fill = after_stat(-sign(median(x))*x < -sign(median(x))))
#   ) +
#   stat_halfeye()
# 
# test = rstanarm::stan_glm(MAOMC ~ agb + height + root_CN + Lig + pK + pCa + pMg + pP + pS + Clay + Sand + pH + block + cop_id,
#                    data = bake(prep(base_rec), new_data = NULL),
#                     prior = rstanarm::normal(0, 1, autoscale = FALSE))
# 
# lmer_fit <-
#   lmer_spec %>%
#   fit(MAOMC ~ agb + height + root_CN + Lig + pK + pCa + pMg + pP + pS + Clay + Sand + pH + block + cop_id,
#       data = df_fit)
# 
# lmer_fit |> 
#   extract_fit_engine() |> 
#   get_variables()
# 
# lmer_fit |> 
#   extract_fit_engine() |> 
#   spread_draws(Sigma[term,group]) |> 
#   ggplot(aes(y = group, x = Sigma)) +
#   stat_halfeye()

Figure 23: Site and phenotypic effects on MAOM C concentration for all predictors using the raw data and a bayesian approach with Stan.

5 Genotype Influence on C Sequestration

There are several estimates we can obtain for the variability in soil carbon stock due to genotype. The least conservative approach is to simply calculate the genotype means from the raw and spatially corrected data. A more conservative approach is to look at the best linear unbiased predictors (BLUPs) from the heritability analysis. In Figure 24 we show the soil C sequestration rates for both analyses and all data sets. The main takeaway is that the heritability analysis gives a much smaller range than the genotype averages—even compared to the spatially-corrected datasets. I would venture to guess that this is because we only have 3 replicates and the mixed effects model is not very confident in the genotype effects, thus it shrinks them toward the global mean and reduces the overall variance. I would argue we could look at the heritability versus simple averages as lower and upper bounds for effects.

Code
# Combine genotype averages and mixed effects results
H2_update <- H2_results |> 
  unnest(gt_eff) |> 
  rename(
    method = data,
    name = phenotype,
    genotype = Level,
    value = Coefficient
  ) |> 
  select(name, model, method, genotype, value)

df_gt_avg_update <- df_gt_avg |> 
  mutate(model = "avg") |> 
  select(name, model, method, genotype, value) |> 
  group_by(method,name) |> 
  mutate(value = value - mean(value, na.rm = TRUE)) |> 
  ungroup()

df_gt_plt <- list(Heritability = H2_update, Average = df_gt_avg_update) |> 
  bind_rows(.id = "analysis")

# df_gt_plt |> 
#   filter(name %in% c("TotC_st"), method  %in% "raw") |> 
#   group_by(analysis, name, model, method) |> 
#   mutate(value = value/13) |> 
#   skimr::skim(value)
  
df_gt_plt |> 
  filter(name %in% c("TotC_st", "TotC_st_30", "MAOMC_st", "POMC_st")) |>
  ggplot(aes(x = value/13, y = method, color = model, linetype = analysis)) + 
  geom_boxplot() +
  xlab("Soil C Effect [tonnes C/ha/yr]") +
  ylab("Data Correction") +
  facet_wrap(~name, scale = "free_x") +
  theme_cowplot(fs)

Figure 24: Comparison of genotype soil C stock variability across spatial correction, C fraction, and analysis type.

For Figure 7 in the manuscript, I have focused on Total Soil C and the raw data set for the heritability and simple average at 0-15 cm depth (Figure 25).

Code
df_gt_plt |>
  filter(name %in% c("TotC_st"),
         model %in% c("full", "avg"),
         method %in% "raw") |>
  ggplot(aes(x = value / 13, y = analysis, color = analysis)) +
  geom_boxplot(outlier.shape = NA) +
  geom_point(position = position_jitterdodge()) +
  xlab("Genotype Effect on \n Total Soil C [tonnes C/ha/yr]") +
  ylab("Analysis") +
  theme_cowplot(fs) +
  theme(
    #axis.text.y = element_blank(),
    #axis.title.y = element_blank(),
    legend.position = "none"
  ) 

Figure 25: Comparison of genotype soil C stock variability across spatial correction, C fraction, and analysis type.

I have run the stock estimates for 0-30 cm depth (Figure 26); however, given the very poor heritability (Figure 10), there is no genotype variation. I need to follow-up on a few items:

  • Double-check 0-30 cm stock calcs with John and figure out why the heritability is so poor. I would hypothesize that the compressed genotype variability is more indicative of poor mixed effect model fit and noisy data than the actual genotype influence.

  • Extract the uncertainty in random effect standard deviation and explore normality of the genotype averages.

  • Create a synthetic data analysis where we have a spatial field of initial C that covaries with texture and we superimpose all positive genotype effects contaminated with noise. Can the spatial confounding and current models actually detect the average change in soil C and genotype effects at n = 3

Code
df_gt_plt |>
  filter(name %in% c("TotC_st", "TotC_st_30"),
         model %in% c("full", "avg"),
         method %in% "raw") |>
  ggplot(aes(x = value / 13, y = analysis, color = name)) +
  geom_boxplot(outlier.shape = NA) +
  geom_point(position = position_jitterdodge()) +
  xlab("Genotype Effect on \n Total Soil C [tonnes C/ha/yr]") +
  ylab("Analysis") +
  theme_cowplot(fs) +
  theme(legend.position = "bottom",
        legend.justification = "center")

Figure 26: Comparison of genotype soil C stock variability across spatial correction, C fraction, and analysis type.

6 Supplement

Code
list(`Raw` = df_fit, `Adjusted` = df_fit_sa_reml) |>
  bind_rows(.id = "type") |>
  pivot_longer(cols = c(MAOMC, POMC, agb, height)) |>
  group_by(name, type) |>
  mutate(value = value - mean(value, na.rm = TRUE)) |>
  ungroup() |>
  group_by(genotype, name, type) |>
  summarize(value = mean(value, na.rm = TRUE)) |>    ungroup() |>    ggplot(aes(
    y = fct_rev(genotype),
    x = value,
    fill = genotype,
    linetype = type
  )) +
  geom_col(position = "dodge", color = "black") +
  xlab("Average Mean Deviation") +
  ylab("Genotype") +
  facet_wrap( ~ name, nrow = 1, scales = "free_x") +
  theme_cowplot(fs) +
  theme(legend.position = "bottom", legend.justification = "center") +
  guides(fill = FALSE) 

I am searching for genotypic differences in total and mineral associated organic matter in the selected poplar trees at Clatskanie. Figure 27 shows the deviation from the mean of Total and MAOM C, where the change in colors indicates a change in genotype as there are ideally three replicates for each. Here, we focus only on the 0-15 cm soil core results as they have the most complete data set and likely corresponds best to the root chemistry and exudate results. There are only two genotypes that have positive influence across all blocks and for both C variables—BESC-265 and BESC-131. The genotype BESC-841 has an overall negative effect for both C variables, while BESC-192 has an overall negative effect on MAOM. Note, the maximum mean deviations are on the order 4%, so not really that large.

Code
df_plt <- df_fit |>
  bind_rows(df_fit_sa_reml, .id = "df") |>
  mutate(df = ifelse(df == 1, "Raw", "Adj"))

df_plt <- df_plt

df_plt |> 
  group_by(df) |> 
  mutate(
    across(.cols = c(TotC, MAOMC, POMC), \(x) x - mean(x))) |> 
  ungroup() |> 
  pivot_longer(cols = c(TotC, MAOMC, POMC)) |> 
  ggplot(aes(
    y = tree,
    x = value,
    fill = (as.numeric(genotype) %% 2 == 0),
    linetype = df,
    color = df
  )) + 
  geom_col(position = position_dodge()) +
  ylab("Tree") + xlab("0-15 cm C Deviation \n [tonnes C/ha soil]") +
  scale_fill_manual(values = c("FALSE" = "white","TRUE" = "gray")) +
  scale_color_manual(values = c("Raw" = "black", "Adj" = "red")) +
  theme_cowplot(fs) + 
  facet_wrap(~name, nrow = 1) +
  theme(legend.position = "none")

Figure 27: Deviation in a) Total and b) Mineral-associated soil carbon in the top 0-15 cm of soil for 23 different poplar genotypes replicated across three blocks. The shading acts as a separation between genotypes.

Next, we look at the change in soil carbon with various environmental and root drivers that are thought to be critical by biogeochemical models such as DayCent (parton1998?). DayCent assumes that the total soil C increases with aboveground biomass (AGB), clay percentage, root lignin, and litter C/N. The thought behind each component is that increased AGB increases C inputs via increased litter, increased root lignin and litter C/N reduces decomposition efficiency leading to soil C accumulation, and increased clay provides sources for mineral association. However, there are feedbacks due to the influence of decreased decomposition on the availability of N and the influence clay on soil moisture driving decomposition, and thus, plant productivity.

Code
df_plt |>
  group_by(df) |>
  mutate(across(.cols = c(MAOMC, Clay, root_CN, Lig, agb), \(x) x - mean(x))) |>
  ungroup() |>
  pivot_longer(cols = c(Clay, root_CN, Lig, agb)) |>
  ggplot(aes(
    x = value,
    y = MAOMC,
    color = df,
    fill = df,
    shape = df,
    linetype = df
  )) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE, alpha = 0.1) +
  facet_wrap(
    ~ name,
    scales = "free_x",
    strip.position = "bottom",
    labeller = as_labeller(
      c(
        agb = "Aboveground Biomass \n Growth [tonnes C/ha]",
        root_CN = "Root C/N",
        Clay = " Soil Clay %",
        Lig = "Root Lignin %"
      )
    )
  ) +
  theme_cowplot(fs) +
  theme(strip.background = element_blank(),
        strip.placement = "outside") +
  guides(
    color = guide_legend(title = "Block"),
    shape = guide_legend(title = "Coppice"),
    linetype = guide_legend(title = "Coppice")
  )

Figure 28: Blockwise Total %C relations with site measurements.

There are numerous soil, root, and AGB variables at our disposal from the Clatskanie site to attempt looking for genotypic differences and drivers for soil C accumulation. Figure 29 tries to identify significant correlations between potential predictor variables in order to not confound subsequent regression approaches. Of the variables, John was most suspicious of bulk density (BD) and the moisture content (MCwetCHEM) as these two are most likely the result of having increased soil C rather than predictors. The strong relationships with Total and MAOM C support this suspicion, and we have decided to asses our statistical models with and without these two variables. The other troubling correlations are between root C/N and the percent C and N (pC and pN), which is obvious given they are calculated from each other. Therefore, we would elect to have the root C/N as it accounts for both. Lastly, the coppiced trees are negatively correlated with AGB growth rate, which may be a little counter-intuitive given that coppicing should produce greater productivity. Although, that may be due to a smaller time period (10 years).

Code
# Spatially adjusted
# df_corr <-
#   df_fit |>
#   select(pH, Sand, Clay, pCa:Znppm)
# # png(
# #   height = 8,
# #   width = 8,
# #   res = 300,
# #   units = "in",
# #   file = file.path(fig_path, "corr-adj.png"),
# #   type = "cairo"
# # )
# 
# a <- cor(df_corr, use = "na.or.complete")
# testRes = cor.mtest(df_corr, conf.level = 0.95)
# corrplot::corrplot(a, p.mat = testRes$p, method = 'circle', type = 'lower',
#          addCoef.col ='black', number.cex = 0.7, order = 'original', diag=FALSE,
#          pch.cex = 0, insig = "pch", tl.col = "black", tl.cex = 0.7)
# dev.off()


# Raw data 
df_corr <-
  df_plt |>
  filter(df %in% "Raw") |>
  select(all_of(c(outcome, "Clay", "Sand"))) 

#rename(outcome |> setNames(outcome_labels))

# png(
#   height = 8,
#   width = 8,
#   res = 300,
#   units = "in",
#   file = file.path(fig_path, "corr-raw.png"),
#   type = "cairo"
# )

a <- cor(df_corr, use = "na.or.complete")
testRes = cor.mtest(df_corr, conf.level = 0.95)
corrplot::corrplot(a, p.mat = testRes$p, method = 'circle', type = 'lower',
         addCoef.col ='black', number.cex = 0.3, order = 'original', diag=FALSE,
         pch.cex = 0, insig = "pch", tl.col = "black", tl.cex = 0.7)
#dev.off()

Figure 29: Correlation matrix of most soil, root, and AGB variables for Clatskanie. Note, only statistically significant correlations are shown.

6.1

Blumstein, Meghan, Anna Sala, David J. Weston, Noel Michelle Holbrook, and Robin Hopkins. 2022. “Plant Carbohydrate Storage: Intra- and Inter-Specific Trade-Offs Reveal a Major Life History Trait.” New Phytologist 235 (6): 2211–22. https://doi.org/10.1111/nph.18213.
Chatterjee, Sourav. 2021. “A New Coefficient of Correlation.” Journal of the American Statistical Association 116 (536): 2009–22. https://doi.org/10.1080/01621459.2020.1758115.
Chatterjee, Sourav, and Susan Holmes. 2023. XICOR: Robust and Generalized Correlation Coefficients. https://CRAN.R-project.org/package=XICOR.
Chhetri, Hari B., David Macaya-Sanz, David Kainer, Ajaya K. Biswal, Luke M. Evans, Jin-Gui Chen, Cassandra Collins, et al. 2019. “Multitrait Genome-Wide Association Analysis of Populus Trichocarpa Identifies Key Polymorphisms Controlling Morphological and Physiological Traits.” New Phytologist 223 (1): 293–309. https://doi.org/10.1111/nph.15777.
Cramer, J. S. 1987. “Mean and Variance of R2 in Small and Moderate Samples.” Journal of Econometrics 35 (2-3): 253–66. https://doi.org/10.1016/0304-4076(87)90027-3.
Dupont, Emiko, Simon N. Wood, and Nicole H. Augustin. 2022a. “Spatial+: A Novel Approach to Spatial Confounding.” Biometrics 78 (4): 1279–90. https://doi.org/10.1111/biom.13656.
———. 2022b. “Spatial+: A Novel Approach to Spatial Confounding.” Biometrics 78 (4): 1279–90. https://doi.org/10.1111/biom.13656.
Evans, Luke M., Gancho T. Slavov, Eli Rodgers-Melnick, Joel Martin, Priya Ranjan, Wellington Muchero, Amy M. Brunner, et al. 2014. “Population Genomics of Populus Trichocarpa Identifies Signatures of Selection and Adaptive Trait Associations.” Nature Genetics 46 (10): 1089–96. https://doi.org/10.1038/ng.3075.
Gelman, Andrew, Jennifer Hill, and Aki Vehtari. n.d. “Regression and Other Stories.”
Harman-Ware, Anne E., David Macaya-Sanz, Chanaka Roshan Abeyratne, Crissa Doeppke, Kathleen Haiby, Gerald A. Tuskan, Brian Stanton, Stephen P. DiFazio, and Mark F. Davis. 2021. “Accurate Determination of Genotypic Variance of Cell Wall Characteristics of a Populus Trichocarpa Pedigree Using High-Throughput Pyrolysis-Molecular Beam Mass Spectrometry.” Biotechnology for Biofuels 14 (1): 59. https://doi.org/10.1186/s13068-021-01908-y.
Kristy, Brandon, Alyssa A. Carrell, Eric Johnston, Jonathan R. Cumming, Dawn M. Klingeman, Kimberly Gwinn, Kimberly C. Syring, Caroline Skalla, Scott Emrich, and Melissa A. Cregger. 2022. “Chronic Drought Differentially Alters the Belowground Microbiome of Drought-Tolerant and Drought-Susceptible Genotypes of Populus Trichocarpa.” Phytobiomes Journal 6 (4): 317–30. https://doi.org/10.1094/PBIOMES-12-21-0076-R.
Sohil, Fariha, Muhammad Umair Sohali, and Javid Shabbir. 2022. “An Introduction to Statistical Learning with Applications in R: By Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, New York, Springer Science and Business Media, 2013, $41.98, eISBN: 978-1-4614-7137-7.” Statistical Theory and Related Fields 6 (1): 87–87. https://doi.org/10.1080/24754269.2021.1980261.
Wood, Simon N. 2017. “Generalized Additive Models: An Introduction with r, Second Edition.” Generalized Additive Models: An Introduction with R, Second Edition, January, 1–476. https://doi.org/10.1201/9781315370279.